2/23/23
Q: I’m curious about how different groups will approach this dataset differently, and hope that we can have some kind of showcase after this case study is completed!
A: I am too - sometimes students add another (related) question. Some find a new related dataset. Some go really deep on visualization. I like the idea of a showcase! Was not in my plan, but I think it should be!
Q: I’m confused about…how to choose the best EDA.
A: The best EDA is the EDA that best helps you understand the datasets being used for analysis. So, while there’s not “one plot to rule them all,” a great EDA uses visualizations and data summaries that let the analyst really understand the data. Two EDAs can make different decisions and both be “best”.
Due Dates:
Notes:
Lingo:
In our case we have:
Panel is balanced: \(n=44∗31\), thus \(n = 1364\)
\[Y_{it}=β_{0}+β_{1}X_{1it}+...+β_{K}X_{Kit}+e_{it}\]
Notes:
❓ Which are examples of variables in our analysis that likely fall into each of these categories?
In our data…some of the unobserved qualities about the different states may be correlated with some of our independent variables (i.e.level of economic opportunity might be an unobserved feature about the states that influences violent crime rate (outcome) and would be possibly correlated with poverty rate and unemployment (predictors))
These individual \(a_i\) effects can be correlated with the independent variables \(X\):
\[Y_{it}=\beta_{0}+\beta_{1}X_{1it}+...\beta_{K}X_{Kit}+ a_i +e_{it}\] …or alternatively the individual effects can be absorbed into an individual-specific intercept term \(\beta_{0i}\):
\[Y_{it}=\beta_{0i}+\beta_{1}X_{1it}+...\beta_{k}X_{kit} +e_{it}\]
plm)pdata.frame (panel data frame); can indicate:
STATE)YEAR)index parameter: index = c("Individual_Variable_NAME", "Time_Period_Variable_NAME")pdata.frame[1] "pdata.frame" "data.frame"
YEAR STATE Black_Male_15_to_19_years Black_Male_20_to_39_years
Alaska-1980 1980 Alaska 0.1670456 0.9933775
Alaska-1981 1981 Alaska 0.1732299 1.0028219
Alaska-1982 1982 Alaska 0.1737069 1.0204445
Other_Male_15_to_19_years Other_Male_20_to_39_years
Alaska-1980 1.129782 2.963329
Alaska-1981 1.124441 2.974775
Alaska-1982 1.069821 3.015071
White_Male_15_to_19_years White_Male_20_to_39_years
Alaska-1980 3.627805 18.28852
Alaska-1981 3.558261 18.12821
Alaska-1982 3.391844 18.10666
Unemployment_rate Poverty_rate Viol_crime_count Population
Alaska-1980 9.6 9.6 1919 404680
Alaska-1981 9.4 9.0 2537 418519
Alaska-1982 9.9 10.6 2732 449608
police_per_100k_lag RTC_LAW_YEAR RTC_LAW TIME_0 TIME_INF
Alaska-1980 194.7218 1995 FALSE 1980 2010
Alaska-1981 200.2299 1995 FALSE 1980 2010
Alaska-1982 191.0553 1995 FALSE 1980 2010
Viol_crime_rate_1k Viol_crime_rate_1k_log Population_log
Alaska-1980 4.742018 1.556463 12.91085
Alaska-1981 6.061851 1.802015 12.94448
Alaska-1982 6.076404 1.804413 13.01613
effectThere are three main options for the effect argument:
STATE identity and time on violent crime rate (effect = "twoways")
modelinterested in how violence in each state varied over time: within STATE variation (model = within)
plm (DONOHUE)DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
effect = "twoways",
model = "within",
data = d_panel_DONOHUE)❓ What is our outcome variable here? What is our primary predictor of interest? Why are all the other variables included?
# A tibble: 11 × 7
term estimate std.e…¹ stati…² p.value conf.low conf.h…³
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RTC_LAWTRUE 2.40e-2 1.70e-2 1.42 1.56e- 1 -9.19e-3 5.73e-2
2 White_Male_15_to_19_years 1.04e-2 2.82e-2 0.367 7.13e- 1 -4.49e-2 6.56e-2
3 White_Male_20_to_39_years 2.93e-2 1.00e-2 2.93 3.50e- 3 9.68e-3 4.90e-2
4 Black_Male_15_to_19_years -5.97e-2 5.78e-2 -1.03 3.02e- 1 -1.73e-1 5.36e-2
5 Black_Male_20_to_39_years 1.23e-1 1.95e-2 6.34 3.17e-10 8.53e-2 1.62e-1
6 Other_Male_15_to_19_years 6.74e-1 1.14e-1 5.92 4.15e- 9 4.51e-1 8.97e-1
7 Other_Male_20_to_39_years -3.04e-1 3.83e-2 -7.95 4.21e-15 -3.79e-1 -2.29e-1
8 Unemployment_rate -1.71e-2 4.98e-3 -3.42 6.36e- 4 -2.68e-2 -7.29e-3
9 Poverty_rate -7.60e-3 2.99e-3 -2.54 1.12e- 2 -1.35e-2 -1.74e-3
10 Population_log -2.11e-1 6.17e-2 -3.42 6.55e- 4 -3.32e-1 -8.99e-2
11 police_per_100k_lag 5.59e-4 1.40e-4 4.00 6.72e- 5 2.85e-4 8.33e-4
# … with abbreviated variable names ¹std.error, ²statistic, ³conf.high
as.formula() (LOTT)LOTT_variables <- LOTT_DF |>
select(RTC_LAW,
contains(c("White", "Black", "Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) |>
colnames()
LOTT_fmla <- as.formula(paste("Viol_crime_rate_1k_log ~", paste(LOTT_variables, collapse = " + ")))
LOTT_fmlaViol_crime_rate_1k_log ~ RTC_LAW + White_Female_10_to_19_years +
White_Female_20_to_29_years + White_Female_30_to_39_years +
White_Female_40_to_49_years + White_Female_50_to_64_years +
White_Female_65_years_and_over + White_Male_10_to_19_years +
White_Male_20_to_29_years + White_Male_30_to_39_years + White_Male_40_to_49_years +
White_Male_50_to_64_years + White_Male_65_years_and_over +
Black_Female_10_to_19_years + Black_Female_20_to_29_years +
Black_Female_30_to_39_years + Black_Female_40_to_49_years +
Black_Female_50_to_64_years + Black_Female_65_years_and_over +
Black_Male_10_to_19_years + Black_Male_20_to_29_years + Black_Male_30_to_39_years +
Black_Male_40_to_49_years + Black_Male_50_to_64_years + Black_Male_65_years_and_over +
Other_Female_10_to_19_years + Other_Female_20_to_29_years +
Other_Female_30_to_39_years + Other_Female_40_to_49_years +
Other_Female_50_to_64_years + Other_Female_65_years_and_over +
Other_Male_10_to_19_years + Other_Male_20_to_29_years + Other_Male_30_to_39_years +
Other_Male_40_to_49_years + Other_Male_50_to_64_years + Other_Male_65_years_and_over +
Unemployment_rate + Poverty_rate + Population_log + police_per_100k_lag
plm (LOTT)# A tibble: 41 × 7
term estimate std.e…¹ stati…² p.value conf.…³ conf.…⁴
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RTC_LAWTRUE -0.0518 0.0162 -3.20 1.39e- 3 -0.0835 -0.0201
2 White_Female_10_to_19_years 0.636 0.149 4.26 2.24e- 5 0.343 0.929
3 White_Female_20_to_29_years 0.00698 0.0670 0.104 9.17e- 1 -0.124 0.138
4 White_Female_30_to_39_years 0.261 0.0813 3.21 1.38e- 3 0.101 0.420
5 White_Female_40_to_49_years 0.0168 0.0814 0.206 8.37e- 1 -0.143 0.176
6 White_Female_50_to_64_years -0.459 0.0625 -7.35 3.60e-13 -0.582 -0.337
7 White_Female_65_years_and_… 0.156 0.0469 3.33 9.04e- 4 0.0641 0.248
8 White_Male_10_to_19_years -0.583 0.143 -4.07 4.92e- 5 -0.863 -0.302
9 White_Male_20_to_29_years 0.0639 0.0623 1.03 3.05e- 1 -0.0582 0.186
10 White_Male_30_to_39_years -0.200 0.0859 -2.33 2.01e- 2 -0.368 -0.0315
# … with 31 more rows, and abbreviated variable names ¹std.error, ²statistic,
# ³conf.low, ⁴conf.high
comparing_analyses <- DONOHUE_OUTPUT_TIDY |>
bind_rows(LOTT_OUTPUT_TIDY) |>
filter(term == "RTC_LAWTRUE")
comparing_analyses# A tibble: 2 × 8
term estimate std.error statistic p.value conf.low conf.high Analysis
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 RTC_LAWTRUE 0.0240 0.0170 1.42 0.156 -0.00919 0.0573 Donohue
2 RTC_LAWTRUE -0.0518 0.0162 -3.20 0.00139 -0.0835 -0.0201 Lott
🧠 With each analysis, what would your conclusion be to the question “What is the relationship between right to carry laws and violence rates in the US”?
ggplot(comparing_analyses) +
geom_point(aes(x = Analysis, y = estimate)) +
geom_errorbar(aes(x = Analysis, ymin = conf.low, ymax = conf.high), width = 0.25) +
geom_hline(yintercept = 0, color = "red") +
scale_y_continuous(
breaks = seq(-0.2, 0.2, by = 0.05),
labels = seq(-0.2, 0.2, by = 0.05),
limits = c(-0.2, 0.2)
) +
geom_segment(aes(x = 1, y = 0.125, xend = 1, yend = 0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2, color = "green", lineend = "butt", linejoin = "mitre"
) +
geom_segment(aes(x = 2, y = -0.125, xend = 2, yend = -0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2, color = "red", lineend = "butt", linejoin = "mitre"
) +
theme_minimal() +
theme(
axis.title.x = element_blank(),
axis.text = element_text(size = 8, color = "black")
) +
labs(
title = "Effect estimate on ln(violent crimes per 100,000 people)",
y = " Effect estimate (95% CI)"
)…so how did this occur?
…not much correlation for non-demographic variables - unemployment and poverty rate show some (as expeted)
…so how do you find out for sure?
RTC_LAWrsample::loo_cv# Leave-one-out cross-validation
# A tibble: 1,364 × 2
splits id
<list> <chr>
1 <split [1363/1]> Resample1
2 <split [1363/1]> Resample2
3 <split [1363/1]> Resample3
4 <split [1363/1]> Resample4
5 <split [1363/1]> Resample5
6 <split [1363/1]> Resample6
7 <split [1363/1]> Resample7
8 <split [1363/1]> Resample8
9 <split [1363/1]> Resample9
10 <split [1363/1]> Resample10
# … with 1,354 more rows
Rows: 1,363
Columns: 20
$ YEAR <fct> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 19…
$ STATE <fct> Alaska, Alaska, Alaska, Alaska, Alaska, Alas…
$ Black_Male_15_to_19_years <pseries> 0.1670456, 0.1732299, 0.1737069, 0.17095…
$ Black_Male_20_to_39_years <pseries> 0.9933775, 1.0028219, 1.0204445, 1.03127…
$ Other_Male_15_to_19_years <pseries> 1.1297816, 1.1244412, 1.0698208, 0.98828…
$ Other_Male_20_to_39_years <pseries> 2.963329, 2.974775, 3.015071, 3.008048, …
$ White_Male_15_to_19_years <pseries> 3.627805, 3.558261, 3.391844, 3.222002, …
$ White_Male_20_to_39_years <pseries> 18.28852, 18.12821, 18.10666, 17.90600, …
$ Unemployment_rate <pseries> 9.6, 9.4, 9.9, 9.9, 9.8, 9.7, 10.9, 10.3…
$ Poverty_rate <pseries> 9.6, 9.0, 10.6, 12.6, 9.6, 8.7, 11.4, 12…
$ Viol_crime_count <pseries> 1919, 2537, 2732, 2940, 3108, 3031, 3046…
$ Population <pseries> 404680, 418519, 449608, 488423, 513697, …
$ police_per_100k_lag <pseries> 194.7218, 200.2299, 191.0553, 364.2335, …
$ RTC_LAW_YEAR <pseries> 1995, 1995, 1995, 1995, 1995, 1995, 1995…
$ RTC_LAW <pseries> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ TIME_0 <pseries> 1980, 1980, 1980, 1980, 1980, 1980, 1980…
$ TIME_INF <pseries> 2010, 2010, 2010, 2010, 2010, 2010, 2010…
$ Viol_crime_rate_1k <pseries> 4.742018, 6.061851, 6.076404, 6.019373, …
$ Viol_crime_rate_1k_log <pseries> 1.556463, 1.802015, 1.804413, 1.794983, …
$ Population_log <pseries> 12.91085, 12.94448, 13.01613, 13.09894, …
fit_nls_on_bootstrap_DONOHUE <- function(subset) {
plm(Viol_crime_rate_1k_log ~ RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = data.frame(subset),
index = c("STATE", "YEAR"),
model = "within",
effect = "twoways")
}subsets_models_DONOHUE <- map(DONOHUE_subsets, fit_nls_on_bootstrap_DONOHUE)
subsets_models_DONOHUE <- subsets_models_DONOHUE |>
map(tidy)
subsets_models_DONOHUE[[1]]# A tibble: 11 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 RTC_LAWTRUE 0.0237 0.0170 1.40 1.62e- 1
2 White_Male_15_to_19_years 0.0108 0.0282 0.382 7.02e- 1
3 White_Male_20_to_39_years 0.0294 0.0100 2.93 3.42e- 3
4 Black_Male_15_to_19_years -0.0596 0.0578 -1.03 3.02e- 1
5 Black_Male_20_to_39_years 0.123 0.0195 6.34 3.12e-10
6 Other_Male_15_to_19_years 0.676 0.114 5.94 3.68e- 9
7 Other_Male_20_to_39_years -0.304 0.0383 -7.94 4.43e-15
8 Unemployment_rate -0.0171 0.00498 -3.43 6.14e- 4
9 Poverty_rate -0.00759 0.00299 -2.53 1.14e- 2
10 Population_log -0.211 0.0617 -3.42 6.43e- 4
11 police_per_100k_lag 0.000556 0.000140 3.98 7.32e- 5
Note: This code takes a while to run. Suggestion to cache this code chunk!
Note: This code takes a while to run. Suggestion to cache this code chunk!
simulations_DONOHUE <- subsets_models_DONOHUE |>
bind_rows(.id = "ID") |>
mutate(Analysis = "Donohue")
simulations_LOTT <- subsets_models_LOTT |>
bind_rows(.id = "ID") |>
mutate(Analysis = "Lott")
simulations <- bind_rows(simulations_DONOHUE, simulations_LOTT)
# order for easier comparison
simulations <- simulations |>
mutate(term = factor(term,
levels = c(
str_subset(unique(pull(simulations, term)), "years", negate = TRUE),
sort(str_subset(unique(pull(simulations, term)), "years")))))# A tibble: 6 × 7
ID term estimate std.er…¹ stati…² p.value Analy…³
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <chr>
1 DONOHUE_1 RTC_LAWTRUE 0.0237 0.0170 1.40 1.62e- 1 Donohue
2 DONOHUE_1 White_Male_15_to_19_years 0.0108 0.0282 0.382 7.02e- 1 Donohue
3 DONOHUE_1 White_Male_20_to_39_years 0.0294 0.0100 2.93 3.42e- 3 Donohue
4 DONOHUE_1 Black_Male_15_to_19_years -0.0596 0.0578 -1.03 3.02e- 1 Donohue
5 DONOHUE_1 Black_Male_20_to_39_years 0.123 0.0195 6.34 3.12e-10 Donohue
6 DONOHUE_1 Other_Male_15_to_19_years 0.676 0.114 5.94 3.68e- 9 Donohue
# … with abbreviated variable names ¹std.error, ²statistic, ³Analysis
simulations |>
ggplot(aes(x = term, y = estimate)) +
geom_boxplot() +
facet_grid(. ~ Analysis, scale = "free_x", space = "free", drop = TRUE) +
labs(title = "Coefficient estimates",
subtitle = "Estimates across leave-one-out analyses",
x = "Term",
y = "Coefficient",
caption = "Results from simulations") +
theme_linedraw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 70, hjust = 1),
strip.text.x = element_text(size = 14, face = "bold"),
plot.title.position="plot")coeff_sd <- simulations |>
group_by(Analysis, term) |>
summarize("SD" = sd(estimate))
coeff_sd |>
ggplot(aes(x = Analysis, y = SD)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
labs(title = "Coefficient variability",
subtitle = "SDs of coefficient estimates from leave-one-out analysis",
x = "Term",
y = "Coefficient Estimate \n Standard Deviations",
caption = "Results from simulations") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(size = 8, color = "black"),
axis.text.y = element_text(color = "black"),
plot.title.position="plot")Variance Inflation Factor (VIF): index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.
If our model is : \(Y = β_0 + β_1X_1 + β_2X_2 + β_3X_3 + e\), then:
To calculate the VIF value for \(X_1\), perform another OLS model, where \(X_1\) is now the dependent variable explained by the other explanatory variables:
\[X_1 = β_0 + β_2X_2 + β_3X_3 + e\]
From this model : \[VIF = \frac{1}{1-R^{2}}\]
…where \(R^2\) value is the proportion of variance in \(X_1\) explained by the other variables ( \(X_2\) and \(X_3\))
Repeat for each explanatory variable in the model.
. . . - the larger the VIF, the more multicollinearity - VIF of >=10 typically indicates problematic multicollinearity
# create model matrix
lm_DONOHUE_data <- as.data.frame(model.matrix(DONOHUE_OUTPUT))
# define model
lm_DONOHUE_data <- lm_DONOHUE_data |>
mutate(Viol_crime_rate_1k_log = plm::Within(pull(
d_panel_DONOHUE, Viol_crime_rate_1k_log
)), effect = "twoways")
# specify model
lm_DONOHUE <- lm(Viol_crime_rate_1k_log ~
RTC_LAWTRUE +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = lm_DONOHUE_data
)
# calculate VIF
vif_DONOHUE <- vif(lm_DONOHUE)# combine into nice table
vif_DONOHUE <- vif_DONOHUE |>
as_tibble() |>
cbind(names(vif_DONOHUE)) |>
as_tibble()
colnames(vif_DONOHUE) <- c("VIF", "Variable")
vif_DONOHUE |>
arrange(desc(VIF))# A tibble: 11 × 2
VIF Variable
<dbl> <chr>
1 1.72 White_Male_20_to_39_years
2 1.66 Black_Male_20_to_39_years
3 1.58 Other_Male_15_to_19_years
4 1.52 Other_Male_20_to_39_years
5 1.34 Black_Male_15_to_19_years
6 1.27 Poverty_rate
7 1.23 Unemployment_rate
8 1.21 police_per_100k_lag
9 1.17 Population_log
10 1.15 White_Male_15_to_19_years
11 1.11 RTC_LAWTRUE
lm_LOTT_data <- as.data.frame(model.matrix(LOTT_OUTPUT))
lm_LOTT_data <- lm_LOTT_data |>
mutate(Viol_crime_rate_1k_log = plm::Within(pull(d_panel_LOTT, Viol_crime_rate_1k_log), effect = "twoways")) |>
rename(RTC_LAW = RTC_LAWTRUE)
lm_LOTT <- lm(LOTT_fmla, data = lm_LOTT_data)
vif_LOTT <- vif(lm_LOTT)
vif_LOTT <- vif_LOTT |>
as_tibble() |>
cbind(names(vif_LOTT)) |>
as_tibble()
colnames(vif_LOTT) <- c("VIF", "Variable")# clean up names
vif_LOTT |>
mutate(Variable = str_replace(string = Variable,
pattern = "RTC_LAW",
replacement = "RTC_LAWTRUE")) |>
arrange(desc(VIF))# A tibble: 41 × 2
VIF Variable
<dbl> <chr>
1 342. Black_Female_10_to_19_years
2 327. Black_Male_10_to_19_years
3 251. Other_Male_40_to_49_years
4 227. Other_Female_40_to_49_years
5 177. Other_Male_50_to_64_years
6 157. Other_Male_10_to_19_years
7 148. Other_Female_10_to_19_years
8 134. Other_Female_50_to_64_years
9 121. White_Female_10_to_19_years
10 120. White_Male_10_to_19_years
# … with 31 more rows
vif_df |>
ggplot(aes(x = Analysis, y = VIF)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
geom_hline(yintercept = 10, color = "red") +
geom_text(aes(.75, 13, label = "typical cutoff of 10")) +
coord_trans(y = "log10") +
labs(title = "Variance inflation factors") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(color = "black"),
axis.text.y = element_text(color = "black"))